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Abstract 

Domain decomposition is a natural route to parallel computing for partial differential equation 
solvers. In this procedure, subdomains of which the original domain of definition is comprised 
are assigned to independent processors at the price of periodic coordination between processors 
to compute global parameters and maintain the requisite degree of continuity of the solution at 
the subdomain interfaces. In the domain-decomposed solution of steady multidimensional sys- 
tems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, 
the only portion of the computation which generally stands in the way of efficient parallelization 
is the solution of the large, sparse linear systems arising at each Newton step. For some Jaco- 
bian matrices drawn from an actual two-dimensional reacting flow problem, we make comparisons 
between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate 
Gradient and Chebyshev type, focusing attention on both iteration count and global inner prod- 
uct count. The generalized minimum residual method with block-ILU preconditioning is judged 
the best serial method among those considered, and parallel numerical experiments on the Encore 
Multimax demonstrate for it approximately 10-fold speedup on 16 processors. The three special 
features of reacting flow models in relation to these linear systems are: the possibly large num- 
ber of degrees of freedom per gridpoint, the dominance of dense infra-point source-term coupling 
over infer- point convective-diffusive coupling throughout significant portions of the flow-field, and 
strong nonlinearities which restrict the time-step to small values (independent of linear algebraic 
considerations) throughout significant portions of the iteration history. Though these features are 
exploited to advantage herein, many aspects of the paper are applicable to the modeling of general 
convective-diffusive systems. 


♦This research was supported by the National Aeronautics and Space Administration under NASA Con- 
tract No. NASl-18107 and by the Air Force Office of Scientific Research under Contract No. AFOSR 88-0117 
while the author was in residence at the Institute for Computer Applications in Science and Engineering 
(ICASE), NASA Langley Research Center, Hampton, VA 23665. 
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1. Introduction 


Computational scientists and engineers porting large-scale codes to most contemporary parallel 
computers are faced directly with the related issues of partitioning programs across processors and 
partitioning data across memories. For the large subset of scientific codes which are based on the 
discretization of partial differential equations in multidimensional domains, domain decomposition 
is a natural basis for the partitioning, at least in the coarse-to-medium granularity range. It is 
“natural” in the sense that both the data structures and the algorithms employed in a serial code 
can be carried over to the parallel case, changed only by the addition of new code to treat the 
artificially introduced boundaries and to mediate computation of global parameters. We shall not 
attempt a quantification of the granularity of the computation in terms of an absolute number of 
processors, since granularity is a relative concept in domain decomposition applications. Typically, 
we would like to require that the subdomains remain “chunky”, in other words, that the number 
of boundary degrees of freedom remain small relative to the number in the subdomain interiors. In 
this respect, the subdomains will resemble the original domain, for which a good solver is presumed 
known. We note in passing that domain decomposition has other virtues apart from creating parallel 
threads of execution, such as operator adaptivity, graceful accommodation of complex geometry, 
and clean memory management of problems too large to fit in active core. Some instances of these 
have been described theoretically and illustratively in the collections [6, 12, 23, 24], 

According to our parallelization paradigm, the domain of definition of the PDE is decom- 
posed into subdomains and a one-to-one mapping of subdomain to processor is created.* For an 
explicit stage of a calculation, of which matrix multiplication is representative, the task of finding 
independent threads of execution ends with the decomposition. Parallelization consists of simply 
subdividing the ranges of DO-loops. For an implicit stage, of which matrix inversion is represen- 
tative, a proven serial algorithm may be employed on each subdomain, but additional provisions 
must be made to enforce the original equations at the subdomain interfaces. In their simplest 
form, the interface conditions can take the form of interpolative updating of data in regions of 
subdomain overlap. Other forms include simultaneous updates of interior and interface data in a 
global preconditioned iterative process and incorporation of the interface constraints in a global 
variational principle. 

There are potentially three penalties to be paid in distributing a solution algorithm over an 
array of independent processors: synchronization overhead, communication overhead, and degra- 
dation of convergence rate. The synchronization penalty arises if the processors have data depen- 
dencies which cause them to idle. This can happen at convergence checkpoints, for instance, or at 
any other point where an exchange of data is required if the processors have unequal amounts of 
work to do. The communication penalty is the time spent physically exchanging messages between 
processors which depend on common data (or granting controlled access to shared memory). This 
can be non-negligible even in perfectly load- balanced parallel configurations. In an effort to keep 
the synchronization and communication overhead down, natural tradeoffs may exist which degrade 
the convergence rate of the algorithm by substituting readily available data for the best possible 
data. These penalties are measured indirectly through the speedup and efficiency figures-of-merit 
of a parallel implementation. A common measure of speedup is the ratio of the uniprocessor execu- 
tion time of a given algorithm to that of the multiprocessor execution of the same algorithm. The 
corresponding measure of efficiency is the speedup divided by the number of processors. Within 
the context of a specific algorithm, efficiency and speedup are functions of problem parameters, 


* Mapping one subdomain to a sub-cluster of processors is also possible, but involves a different class of issues, since space- 
based decompositions lend themselves to lock-step homogeneous programming of the individual processors, while non-space- 
based decompositions might not. In a multigrid context, generalization in the other direction might be natural, namely mapping 
several subdomains from different multigrid levels onto the same processor. 


1 


such as problem size and decomposition granularity, as well as machine parameters, such as the 
ratio of local to remote memory access times or the ratio of floating point arithmetic time to the 
latency period of interprocessor messages. While theoretical computational complexity analyses 
can elucidate some general trends, the demonstration of parallel efficiency inevitably requires some 
experimental information in all but ideal problems. 

The cost-effectiveness of obtaining parallelization through domain decomposition as opposed 
to other methods can depend on the relative weighting of three (to some extent mutually exclusive) 
summands in the cost objective: efficiency, wall-clock execution time, and software development 
time. Though efficiency is always important in the many-processor regime, our main interest is 
in algorithms that perform competitively over a range of granularity which specifically includes 
a single processor. In confining attention to algorithms which are “methods of choice” in serial, 
many which parallelize efficiently are ruled out. One near-term motivation for techniques which 
perform acceptably over a range of granularities is that jobs which use nearly all of the memory 
of a multiprocessor supercomputer, such as the Cray-XMP or the ETA-10, effectively tie up most 
of the processors sharing that memory, whether their cycles are actually employed or not. The 
ability to keep the processing force in rough proportion to memory usage over a range of problem 
sizes can be an economic advantage. Independent of economic considerations, there is obviously 
a premium on the ability to reduce wall-clock time by adding processors in “production mode” 
in many applications. Until the elusive day when native software facilities make parallelization 
automatic, domain decomposition will have a significant role to play in meeting these objectives. 

The outline of this case study of the applicability of domain decomposition to the computation 
of reacting flows (which must be regarded as merely preliminary in several respects to be noted) is 
as follows. Section 2 describes a reacting flow model with features that stretch the supercomputer 
state-of-the-art, and also the relation to it of an approximate reduced model known as the “flame 
sheet”, to which numerical consideration is confined in the sequel. A numerical discretization and a 
solution algorithm are presented in section 3. Section 4 provides a brief complexity analysis for the 
domain-decomposed parallel implementation and shows the pivotal role in the parallel efficiency 
of the sparse linear solves required by Newton’s method. Section 5 discusses alternative linear 
algebraic methods and presents some serial results for sample Jacobians drawn from flame sheet 
simulations. Parallelization issues for the linear algebra itself are considered in section 6, and some 
numerical results obtained for one of the methods over a range of 1 to 16 processors on the Encore 

Multimax, a shared memory computer. In section 7 we draw some conclusions and state some open 
research questions. 

2. Mathematical Modeling of Reacting Flows 

The goal of this section is to convey the structure and dimension of the reacting flow system, 
and not todwell on the underlying physics. Our attention is confined to two space dimensions’ 
because a “third dimension”, namely, the number of components defined at each point, already 
appears in the full model, and because many combustion applications are well approximated by 
axisymmetry. Under this restriction, it is natural to adopt a streamfunction-vorticity formulation 
of the fluid dynamics rather than a velocity-pressure formulation. The number of unknowns is 
thereby reduced by one, and more conveniently, each PDE in the streamfunction/vorticity system 
is formally of elliptic type. 

For definiteness, we cite without derivation the following set of elliptic boundary value problems 
(see, e.g., [13]). Though typical of many reacting flow models, they do not include effects which 
will in some contexts be essential (e.g., radiation, turbulence), and which could alter the algebraic 
connectivity of the discrete governing system. The problem motivating the present work is an 
axisymmetric isobaric laminar flame, such as the gaseous co-flowing jet flame sketched in Fig. 1. 
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Figure Is Schematic of a confined non-premixed laminar jet 
flame. 

Let r and z denote the radial and axial directions, v r and v z the respective velocity components, 
and p the density. We introduce the variable density Stokes streamfunction such that 


9 i> A 

prv T = and prv z 

dz 


d'lf) 

dr 1 


and the vorticity 

dv r dv z 
U dz dr 

The species concentrations by mass are Y*, for k = 1,2, . . K y where K is the number of distinct 
chemical species in the reaction mechanism. The temperature is T. These principal fields satisfy 
the following equations. 

Streamfunction (overall mass conservation): 


dz \rp dz J dr \rp dr ) 


( 2 . 1 ) 


Vorticity (momentum conservation): 



( 2 . 2 ) 


Species (detailed mass conservation): 


d_ 

dz 



+ §- r (rpY k V kr ) + j- z (rpY k V kz ) 


—rW k w k = 0, k = l,2,...,K. 


( 2 . 3 ) 
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Temperature (internal energy conservation): 



t'EU 1 '* + = 0. (2.4) 

fc=l k v ft *=1 

Other parameters appearing in this system are: the viscosity of the mixture /x, the acceleration 
of gravity g , the diffusion velocities of the k th species in the mixture (V kr1 V kx ), the specific heat of 
the k species c pk , the specific heat of the mixture c p , the thermal conductivity of the mixture A, 
the molecular mass of the k ih species W kj and the generation/consumption rate of the k th species 
w k . 

The system is closed with the multicomponent ideal gas law, 


P = 


pW 

RT ’ 


(2.5) 


where W is the mean molecular mass of the local mixture, p the pressure, and R the universal gas 
constant. The Arrhenius reaction rate law for a system of J reactions gives 


J 


- if-k) 
j=i 



where the forward rate coefficient of reaction j is given by 


( 2 . 6 ) 


kj = AjT^expi-Ej/RT), ( 2 . 7 ) 

and where, in turn, Aj, the pre-exponential rate constants, /?,, the temperature exponents, Ej, 
the activation energies, and K], the equilibrium constants, are assumed known. The v jk are the 
stoichiometric coefficients of the k th species in the j th reaction. Superscripts ' and " on the Uj k denote 
the reactant and product sides of the stoichiometric equation, respectively; thus w k is positive where 
species A: is produced. Thermodynamic and constitutive equations provide the temperature- and 
composition-dependent specific heats, viscosities, thermal conductivities, and diffusion velocities. 
The matrix of stoichiometric coefficients is typically very sparse, with each of the J elementary 
reactions involving only a few of the K species. For hydrogen or lower hydrocarbon flames in air 
J and K are 0(10* — 10 2 ). 

Though formally elliptic, the presence of strong convection renders all of the governing equa- 
tions except that of the streamfunction approximately parabolic in z. 

In most practical combustors, the reactants are supplied in a non-premixed state in streams 
from different reservoirs, and the rate of combustion is limited by the diffusive mixing of the 
reactants rather than chemical reaction rates. The flame sheet model exploits this large Damkohler 
number (ratio of diffusion to reaction time scales) by replacing a reaction zone of finite thickness 
with an interface (of unknown location) across which gradients of the temperature and species 
are discontinuous.* The interface subdivides the physical domain SI into an oxidizer-free zone Sip 
and a fuel- free zone Slo, in either of which the full composition and thermodynamic state of the 
gas mixture can be recovered from a single conserved scalar. The flame sheet is an economical 
means of computing initial iterates for detailed kinetics calculations [16], since it requires just 

*Our version of the flame sheet includes some additional simplifying assumptions; see [16] for details. 
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three components per gridpoint (the variable density Stokes streamfunction ip, the vorticity u>, 
and a conserved scalar 5) instead of the dozens that would be required with full chemistry. The 
equations which replace (2.3)-(2.7) are: 

Conserved Scalar (Energy and Mass Conservation): 



Algebraic State Relations: 


p{x) 


p{x) 


D(x) 


{ 

{ 

{ 


p F (S(x)), 

x E SIf 1 

(2.9) 

po(s{x)), 

x E flo J ’ 

p F (S(x)), 

x E JIf 1 

(2.10) 

Po(S(x)), 

x E f lo J ’ 

Df(S(x)), 

x e Of 1 

(2.11) 

Do(S(x)), 

x E flo J ’ 


The fields of T and the Yk may also be obtained from 5 and are convenient intermediates in 
the evaluation of the right-hand sides of (2.9)-(2.11). Though the flame sheet is a more compact 
problem in terms of the number of unknowns, the spatial density distribution, with sharp gradients 
leading up to a gradient discontinuity at the flame front, carries the difficult nonlinearities of the 
original system. The ratio of maximum to minimum density in the flow field is typically O(IC) 1 ), 
and the ratios of the extreme viscosity and mass diffusion coefficients, fi in (2.2) and pD in (2.8), 
are on the order of the square root of the extreme density ratio. Fig. 2 displays the converged 
flame sheet density profile for the problem studied in sections 5 and 6, and the grid on which it was 
obtained. Note the severe gradients near the fuel/air inlet boundary at r = 0.2 cm. For plotting 
clarity, the true axial-to-radial aspect ratio of 8:1 has been distorted. 

As an illustration of the quality of the flame sheet results for the temperature distribution, we 
display Fig. 3 from [16] which compares a flame sheet solution on a one-dimensional 38-point grid 
to a detailed kinetics solution on an adaptive 65-point grid. However, the flame sheet results have 
little information to offer concerning the chemical structure of the reaction zone, because of the 
restrictive assumptions employed, especially that the product of the fuel and oxidizer concentrations 
is everywhere zero. 

3. The Discrete Governing System and Its Solution 

The governing equations, along with appropriate boundary conditions, are differenced on a 
two-dimensional tensor product grid (but see also [30] for local adaptive gridding results) which is 
generated adaptively from an initial coarse grid by subequidistribution of gradients and curvatures 
of the solution components. This concentrates grid points in the regions of high-activity (fronts and 
peaks) in the domain. Second-order differences are used throughout except for gradient boundary 
conditions, and for the convective terms of the conservation equations in which first-order upwind 
differences are employed. This discretization can be accommodated within the standard nine-point 
stencil (the corner points being necessitated by the mixed derivatives in the convective terms) and, 
apart from the source term, insures the diagonal dominance of the Jacobian. The evaluation of 
the governing equation residuals at a given solution iterate constitutes a significant expense, so 
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Figure 2: Density contours in a co- flowing non- premixed 
methane-air flame, computed in the flame sheet limit. Except 
for the spatial variability of mixture molecular weight W, the 
temperature field is inverse to this through (2.5), and thus 
has similar structure. The hottest region (and the density 
trough) lies near the fuel inlet. The grid is shown on the 
right. 


methods which make efficient use of the function evaluations are required. For such problems, 
robust variations of Newton’s method are often preferable to less fully coupled iterative methods 
or associated explicit time-marching methods (see, e.g., [28]). 

We write the overall system in the form 

F{4>) = 0, (3.1) 

where 4> represents a column vector of all of the unknowns. Equation (3.1) may be solved efficiently 
by a damped modified Newton method provided that an initial iterate <£<°> sufficiently close to the 
solution <f>* is supplied. The damped modified Newton iteration is given by 

*<*+*> = + A <*>**<*>, 

where 
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(3.3) 



Figure 3: Comparison between one- dimensional tempera- 
ture profiles computed from flame sheet and a detailed kinet- 
ics formulations in a non-premixed methane-air flame. 

where the matrix J<‘> is an approximation to the actual Jacobian matrix equated at the i*‘ 
iterate. We refer to «#> as the k' h update. When A<‘> = 1 and = J< > = IjW 1 >), for all k, a 
pure Newton method is obtained. The iteration terminates when some (possibly scaled) 2-norm of 
S 4 >W drops below a given tolerance. In well-conditioned systems, this will, of course, also be true 

of the norm of F(<j > ^). . , , , , 

A sufficiently good initial condition for Newton’s method is usually obtainable by an elementary 

continuation procedure, namely driving a pseudo-transient form of (3.1), 


D^ + F (4>) = 0, 


(3.4) 


where D is a scaling matrix, part of the way towards its steady state. If (3.4) is implicitly time- 
differenced with the backward Euler method using time step A t from a given initial state <j>, a new 

system, 


= D 


_ n i+-4) 


At 


+ F{4>) = 0, 


(3.5) 


results, which possesses the Jacobian 


D_ dF_ 
J= At + dcj>' 


With an appropriate D and for a succession of a sufficiently small At, which will, however, in 
pursuit of efficiency, be larger than what would be required for accurate resolution of the physical 
transient, the iteration of (3.5) is a numerically robust procedure for generating an iterate from 
which Newton’s method will converge. Moreover, due to the close relationship of (3.5) to (3.1), only 
minor modifications of the steady-state code for (3.1) are required to make it a transient /steady- 
state hybrid. Adaptive control of the size of the time steps used in (3.5) can enhance the convergence 
rate of the hybrid algorithm. One useful strategy is to choose At based on a loose temporal 
truncation error bound on the most rapidly varying component of the solution. As 4> approaches 
<f>* and At becomes large, this effects a smooth transition to the (infinite time step) steady-state 

formulation. 
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Fl * U " 4: A 8chematic of the Jacobian matrix of the 9- 

point finite-difference operator showing its sparsity structure, 
(a) upper left: a two-dimensional 5x6 grid with the sten- 
cil corresponding to (i,j) = (3,3) highlighted, (b) upper 
right: gross sparsity structure of the Jacobian, n 2 blocks of 
size mr x mr, representing the coupling between rows, (c) 
lower left: an enlargement of the cross-hatched block, m 2 
blocks of size r x r, representing the coupling between points 
within a row. (d) lower right: an enlargement of the double- 
cross-hatched block, r 2 blocks of size 1 x 1, representing the 
coupling between degrees of freedom at a point. 
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The complexity of the governing equations in either formulation above precludes analytic ex- 
pression of the elements of the Jacobian; therefore a finite-difference approximation to the Jacobian 
must be used. Extending the sparse Jacobian evaluation ideas of Curtis, Powell and Reid [9J, as 
described in more detail elsewhere [29], we can form all of the nonzero elements from (9r + 1) 
independent vector residual evaluations. This is still a large number of function evaluations, with 
the result that coefficient generation becomes at least a co-dominant term in running time of the 

The method described in this section has been successfully applied to a laminar methane-air 
jet in serial by means of a three-pass solution method, and results have appeared in [30], In the 
first pass, which was carried out on a relatively coarse grid, the flame sheet model was used to 
generate hydrodynamic and temperature profiles. In the second pass, the temperature profile was 
maintained at the flame sheet solution, and the full system was solved with the energy equation 
(2.4) omitted. Finally, the energy equation was readmitted, the temperature profile allowed to 
adjust, and the full system solved on several successively finer grids, selected adaptively to resolve 
the high activity regions of the flame. This procedure, which is similar to that found invaluable in 
the one-dimensional context [16], promotes convergence of the full nonlinear iterations by gradually 
phasing in the details of the physical model, so that a good starting estimate exists for each pass, 
and promotes efficiency by keeping the size of the discrete system relatively small until the endgame. 
Even so, the equivalent of approximately 10 Cray-XMP hours is required for the complete effort. 

Each of the three phases employs Newton’s method as described above. From (3.2)-(3.3) and 
the paragraph following, we identify for further analysis the five basic tasks which together account 
for almost all of the execution time required by the code: (1) DAXPY vector arithmetic, (2) the 
evaluation of norms, (3) the evaluation of residual vectors, (4) the evaluation of Jacobians, and (5) 
the solution of linear equations involving the Jacobian matrix. 

4. Complexity Analysis for a Domain-Decomposed Solver 

A complexity analysis of a fairly involved computational procedure is most clearly approached 
in a modular fashion. Accordingly, we begin by constructing operation counts for the five major 
computational subtasks listed at the end of the previous section, then combine them in proportion. 
We restrict our analysis here to strip- wise decompositions and to leading-order effects. A strip- wise 
decomposition with radially oriented strips is natural for a domain consisting of a symmetry plane of 
the domain sketched in Fig. 1 because the high aspect ratio will lead to tight radial coupling in the 
diffusive terms of the governing equations, whereas the axial coupling is dominated by convection, 
and hence predominantly unidirectional. 

We begin with a model for the computer itself. A network of p homogeneous processor ele- 
ments is assumed, each of which has sufficient local memory or a devoted share of global memory 
to represent the data of the associated subdomain. We define 7 as the unit of scalar floating 
point operation time. Any vector-processing capability of these elements is ignored, although we 
acknowledge that the potential savings of vectorization are considerable in this context [ 11 ]. 

The processors communicate by passing messages or accessing shared memory through a global 
bus, depending upon the interconnection network. We define a as the time overhead to route a 
message of any length between a pair of neighboring processors or enforce a “lock” on a shared 
memory location, and (3 as the additional transfer time per floating point word, which is simply the 
reciprocal of the bus bandwidth for access to shared memory. For simplicity, we assume that only 
sending messages (or depositing data in the shared memory) takes time. Receiving (or reading data 
is assumed to be free. Since all sent messages are presumed received at some point, any additional 
reception cost which ought to be included to model a particular machine can be incorporated by 
simple readjustment of a and (3 ( e.g ., doubling them if the two costs are comparable). 
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We define two architecture-dependent functions of p: Ce, the local exchange coefficient between 
processors sharing neighboring sides, and Cr, the global reduction coefficient. The global reduction 
operation produces a single scalar from scalars defined on each processor, whether by addition in 
the case of a global inner product whose partial sums are locally accumulated, or by the logical 
“and” operation in the case of a convergence test. These coefficients which multiply terms in a and 
(3 take into account the possibly degraded ability of the network to exchange data as the number 
of processors grows. In the case of the ring of processors or a hypercube, Ce is unity regardless 
of the number of processors because there are channels solely devoted to these nearest- neighbor 
message routes in the network. In the case of the shared-memory machine, we make for simplicity 
the extreme assumption that the bandwidth to memory for each processor must be divided by the 
number of processors; hence Ce is p itself. For a ring or a hypercube, Cr is twice the diameter of 
the network, since a partial sum must be collected, then redistributed. Hence Cr is p or 21ogp, 
respectively. Cr is 2 p for a shared-memory machine. 

The complexity model depends in addition on basic discrete problem dimensions: the resolution 
of the grid n (assumed the same in each direction for simplicity), and the number of components 
per gridpoint r. We also require an arithmetic operation count coefficient C A , which depends on the 
details of the physics and chemistry incorporated into the governing equations and the number of 
points in the discrete stencil. The product of C A with n 2 r gives the number of arithmetic operations 
per residual vector evaluation. In terms of these parameters the major algorithmic subtasks have 
the following complexities. 

DAXPYs: 



The DAXPY requires no communication and parallelizes perfectly over any partitioning. 
Global Norms: 



7 + [C R ]a + [C fl ]/J. 


The norm requires local inner products followed by a global reduction of the scalar partial sums 
Residual Vector Evaluation: 



7 -I- [2 Ce\ a + [2 nrCsl (3. 


To satisfy the data requirements of the nine-point finite difference stencils on the artificial bound- 
aries of the subdomains, each strip subdomain must exchange its two artificial boundary rows with 
its neighbors. 

Jacobian Evaluation: 


J ((9r+l)CUr+18r 2 )J 7+ 

[(2(9r + 1) + 6 )Ce] a + [(2(9r -f l)nr + 6nr 2 )C£;] (3. 

T he J j C ° bian evaJuatlon cost includes (9r + 1) independent function evaluations, along with some 
finite difference arithmetic to generate the matrix elements. The communication terms include 
both the sharing of subdomain boundary data needed for the function evaluation and the exchange 
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of 6n r x r Jacobian blocks generated on adjacent processors using the grid-coloring described in 
[29]. 

For a base-line linear algebraic method, we employ concurrent block-line relaxation, sweeping 
in the downstream axial direction within each subdomain to follow the convection. In the limit of 
one processor, this is the block-line Gauss-Seidel method; in the limit of n processors, so that the 
each subdomain consists of only one row, this is the block-line Jacobi method. Treating only the 
three central diagonal blocks implicitly, each row of the sweep requires the formation of a modified 
right-hand side which includes a term in r 2 from each of the other six Jacobian blocks in the same 
row, followed by a block tridiagonal solve. The factorization of the block tridiagonal system need 
be performed only once per Jacobian evaluation. The assemblage of the right-hand sides (which 
includes the exchange of the rows lining the artificial boundaries) and the back-substitution is 
required on each sweep, as is a convergence check based on a global norm. Thus, we have: 

Block Relaxation Factorization Phase: 



7- 


Block Relaxation Sweep Phase: 



7 + [2 Ce + Cr] a + [2 nrCE + Cr ] (3. 


If the linear systems involving the Jacobian are solved to a level of convergence which is inde- 
pendent of the granularity of the decomposition, then the number of Newton steps is independent of 
the type of decomposition and the number of processors.* We may therefore use the Newton step as 
the largest aggregate in the complexity model, and work out the complexity on a per-Newton step 
basis. Neglecting the overhead associated with possible problem-dependent damping and selection 
of time steps, each Newton step consists of one DAXPY, one norm, one function evaluation, l/N 
Jacobian evaluations and L relaxation sweeps, This count introduces the final two parameters of 
the model: the average number of Newton steps between Jacobian evaluations, JV , and the average 
number of relaxation sweeps per Newton step, X. For systems as complicated as (2.1) through 
(2.7), no theory exists from which either N or L can be specified as a function of n and p. We must 
therefore obtain N and L from problem-specific experiments. 

For present purposes, we will focus on an individual point in physical and numerical parameter 
space, leaving only the machine and network parameters free. (A more general approach is taken 
in [17].) We adopt the 16-species, 46-reaction mechanism of [22] for the oxidation of methane. Here 
r = 19 and the operation count coefficient Ca is determined from counting and relative timing 
measurements to be approximately 2.6 x 10 3 . L « 200 and N « 5 are also available from averaging 
over various runs of the serial code on a grid of approximately 1200 nonuniformly-spaced points. 

With the preceding formulae, we can estimate the execution time T p per Newton step of the 
p-processor parallelized solver through a relation of the form: 


T p - 7 fy + a fa + Pfp- 


(4.1) 


From the assigned values, we can extract the leading terms of each component of sum as: 


♦This condition may not be satisfied for relaxation methods in which convergence is based solely upon the size of the 
update, since the deteriorating condition number of the iteration matrix with finer granularity may allow a larger residual at 
convergence. 
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The linear appearance of L , the iteration count for the relaxation method, in the leading terms 
of the communication overhead puts a premium on keeping this parameter as small as possible 
for good parallel efficiency. 0/ 7 = 0(1) and a/ 7 = O(10 2 ) are typical. As p, which appears to 
a negative power in / 7 and to non-negative powers in C E and C R increases, the communication 
terms potentially dominate, rendering the parallel arithmetic capability worthless. For p greater 
than some Pcrit(<*/ 7 , /3/ 7 ), the phenomenon known as speed-doum can occur, in which adding 
processors increases wall-clock execution time. 

5. Alternative Linear Algebraic Methods 

In this section we compare a variety of linear algebraic methods on a series of four Jacobian- 
residual systems of the form 

6<f>W = -F(<t>W). 

The results for the iteration count and global inner product count are shown in Table 1. The first 
three Jacobians come from flame sheet problems on a 31 x 31 grid, the last on a more refined 63 x 63 
grid, with discrete dimensions of 2,883 and 11,907, respectively. The first is drawn from the first 
pseudo-transient Newton time step on the 31x31 grid, following interpolation from the Newton- 
converged solution on a previous (coarser) grid. A At of 1.0 x 10~ 5 was employed as a continuation 
parameter for this particular Newton step, which had an linear residual of O(10 4 ). The next derives 
from the 84 th and last pseudo-transient Newton step, after At had adaptively grown to 1.0 x 10 -1 . 
The third comes from a pure Newton step (At = 00 ). Finally, after interpolation of the converged 
solution to the 63 X 63 grid, a Jacobian was written based on an initial At of 1.0 x 10 -9 on a 
problem with a linear residual of O(10 8 ). Though this seems like an excessively conservative initial 
time step, the strong nonlinearities and steep gradients prevent the pseudo-transient procedure 
from converging if a much larger step is attempted. Three of the four linear algebraic methods 
presented below axe adequate for obtaining convergence of the linearized problem at larger time 
steps (affording less diagonal dominance), but the updated nonlinear residual norms will diverge 
during the pseudo-transient Newton stepping. 

Any linear system solution algorithm intended for use in fluid dynamical applications should 
be robust with respect both to asymmetry, to allow for the presence of convective terms, and 
indefiniteness, to allow for the presence of linearized source terms whose coefficients oppose the 
algebraic sign of the diagonal term of the discrete convective-diffusive operator. The latter can 
be particularly important in the modeling of chemically reacting flows in which at any point in 
the flow field some species may be created while others are consumed. In the flame sheet context, 
however, indefiniteness does not occur, since (2.8) contains no source terms. 

The first technique is block-line relaxation, the standard SOR method [14] (except for the 
misnomer when u, the over-relaxation parameter, is less than one). The Jacobian (denoted by A 
in this section), partitioned at the level shown in Fig. 4(b) is decomposed additively in the form 
A — D ~pL — Cy. In the limit of strong convection with upwind differencing, the lower triangular 
part C L is more influential than the upper triangular part Q/, and the splitting matrix {-D - C L ) 
is chosen, with no symmetrizing step. With L = D~ l C L and U = D~ l C v , the overalfiteration 
matrix may be written as (I - uL) 1 (uU + (1 - w)/), which form reveals how u> may be adjusted 
downward to gain control over a divergent problem. Of course, the application of D' 1 involves a 
block tridiagonal solve at the level of Fig. 4(c), in which a dense LU factorization is performed at 
the level of Fig. 4(d). 
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Table 1: Iteration count /, and inner product count IP 
for Jacobians from different stages of the flame sheet prob- 
lem, using the four linear algebraic techniques described in 
the text. Convergence for the block-line method was based 
on an absolute tolerance of 1.0 x 10 5 on the normalized 2- 
norm of the update vector. Convergence for the generalized 
minimum residual and Chebyshev iterations was based on a 
relative tolerance of 1.0 X 10” 5 on the 2-norm of the residual 
vector. Comparing the two convergence criteria, the block- 
line criterion was slightly lenient, in that its terminal relative 
residual reductions were less than five orders of magnitude in 
some cases. (A hypen indicates divergence. A double quote 
indicates a trivial repeated result.) 

We note that the block-line method is very effective in the first system, which permits over- 
relaxation, but becomes increasingly less so as the calculation evolves, and eventually fails to 
converge even at « = 0.5 in the largest problem. Since each iteration requires an inner product to 
evaluate the convergence criterion, I and IP are equal for this case. 

The second technique is the Generalized Minimum Residual (GMR) method [25] with an 
point incomplete LU factorization preconditioner, permitting no fill-in. The code PCGPAK [27] 
was employed for these runs. Given a system of equations, Mx = f, M nonsingular, and an initial 
iterate, x 0 , with initial residual, r 0 = / - Mx 0 , GMR computes the solution x from finding z E K k 
such that 

(r 0 - Mz,v ) = 0, 

for all t; G L k , and setting x = x 0 + z, where K k and L k are Krylov spaces based on r 0 : 

K k = span{r 0 , Mr 0 , . . . , M fc_1 r 0 }, L k = span{Mr 0 , M 2 r 0 , . . . , M k r 0 }. 

The solution x computed after k steps of GMR minimizes ||r|| 2 in the affine space x 0 +K k . In a prac- 
tical algorithm, an orthogonal basis for K k is built up by means of a Gram-Schmidt or Householder 
process, which obviates the necessity of working with the normal equations. Suitable computer 
implementations of GMR have been given in [25] and [31], of which the former is employed in 
these tests. Among the desirable properties of GMR are: (1) the only reference to M is in form of 
matrix- vector products, (2) it cannot break down (in exact arithmetic) short of delivering the solu- 
tion even for nonsymmetric systems with indefinite symmetric part, (3) it requires less storage and 
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fewer operations per step than the mathematically equivalent GCR and ORTHODIR algorithms, 
and (4) the 2-norm of the residual is non-increasing and can be monitored without constructing 
intermediate solution iterates. The main disadvantage of GMR is the lack of a bounded recurrence 
relation, which causes the operation count and storage requirements to grow quadratically and 
linearly, respectively, in the iteration index. We note in particular that k(k + l)/2 inner products 
are required in a A;-step procedure. In many applications, restarting GMR after a predetermined 
number of steps is amelioratory, but restarted GMR can also fail through stagnation. 

GMR is often uneconomical when left to act by itself on a general reaction- convection- diffusion 
operator. In an effort to control the work and storage required by GMR when A has a widely spread 
spectrum, we precondition the iterations by taking M in the formulae above to be Af? -1 , where 
B is an approximation to A whose inverse is relatively inexpensive to apply. This is “right” pre- 
conditioning, which first solves Mi = / for i, then Bx = x for x. We adopt right over left 
preconditioning because in the latter the matrix B enters into the GMR residual convergence crite- 
rion in a direct way, making convergence comparisons between different preconditioning techniques 
difficult. Approximate factorizations of the original matrix into triangular matrices, such as in- 
complete LU-decomposition (ILU) [21], are useful general purpose preconditioners for GMR and 
other Krylov methods. “Incomplete” refers to the fact that certain elements which would fill in 
and destroy the sparsity of A in a complete factorization are never computed or stored, saving 
operations and memory. The present examples employ the crudest of techniques appropriate to 
nonsymmetric systems, a Crout-Doolittle ILU(O), where the zero indicates that no fill-in outside 
the original sparsity pattern of A is allowed (see [32]). 

From the table, we note that GMR/ILU is more robust than the block-line method, but requires 
substantial memory and inner products in “difficult” problems. Two versions of GMR/ILU are 
presented. The K = 10 version is a restarted GMR which discards the Krylov subspace every 10 
iterations, constructs the new residual vector, and iterates again. Using a restart cycle reduces 
inner product count by keeping k small in the quadratic term. However, as in the fourth system 
presented in the table, overall iteration count can increase and take back some of this gain in the IP 
category. The K = 50 version is essentially K — oo in the cases considered, since I never exceeded 
50. The preconditioner is essential. As a baseline reference for the first and best conditioned case in 
the table, only two decades of residual reduction were obtained after 50 steps of unpreconditioned 
GMR, or after 200 steps of restarted ( K = 10) unpreconditioned GMR. 

The third technique is again GMR, but with a block preconditioning at the level of Fig. 4(c). 
This algorithm was hand-coded from scratch in anticipation of a parallel port (see next section). 
Block preconditioning exploits the fact that the tightest coupling in our systems is between un- 
knowns at the same gridpoint, which it handles exactly. The superiority of BILU over ILU is 
apparent in all table entries. Of course, the preconditioning matrix is itself more expensive to form 
and apply. 

The fourth technique is BILU-preconditioned adaptive Chebyshev iteration [19], suitable for 
cases in which the indefiniteness can be controlled and advantageous in parallel contexts because 
of its short recurrence relation. The code CHEBYCODE [1] was employed for these runs, supple- 
mented by the same block-ILU code used in the previous technique. 

From the initial iterate, Xo> the Chebyshev algorithm generates a sequence of iterates such 

that 

Tk - Pk{M)ro, 

where 

and Tjt is the k th Chebyshev polynomial of the first kind. The parameter d defines the center and 
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c the foci (at d ± c) of a family of confocal ellipses in the complex plane on which the Chebyshev 
polynomials are rapidly decreasing in norm as k increases. There is a smallest such ellipse which 
contains the spectrum of M. If it does not contain the origin, the sequence of residuals above is 
rapidly convergent. Like GMR, Chebyshev iteration deals directly with a nonsymmetric matrix 
Af, without reference to the normal equations. Unlike GMR, Chebyshev iteration requires the 
estimation of parameters describing the convex hull of eigenvalues of M . After an initial guess, 
these estimates are obtained adaptively using the modified power method labeled “Method 3” in 
[20] as a by-product of successive residuals. For a sub-sequence of four residuals, the cost of these 
estimates is 14 inner products per adaptive step. The present examples employ d = 1, c = 0 as the 
initial guess, which describes circles centered at the point z = 1. This is a reasonable estimate for 
a good preconditioner like block-ILU, since the spectrum of AB ~ 1 lies in a small region near 1. (In 
all the cases we have inspected, the real parts are bounded between 0.2 and 1.5 and the imaginary 
parts between ±0.4.) 

The table shows that Chebyshev/BILU is always less powerful than GMR/ILU in terms of 
iteration count (and hence matrix-vector products using M ), but always superior in terms of inner 
product count. Two versions of Chebyshev/BILU are presented. The C = 10 version forces an 
adaptive estimation of d and c every ten steps unless the iterations are converging sufficiently 
rapidly. The C = 20 version allows twice as many slowly converging steps between adaptations. 
(Divergence also invokes the the adaptive procedure in CHEBYCODE, but this did not occur in 
our examples.) Iteration count is not monotonic in C (unlike K in GMR), so problem-specific 
investigation is worthwhile. 

In summarizing the discussion of Table 1, block-preconditioned GMR and Chebyshev meth- 
ods have complementary strengths. The GMR version always requires fewer references to M and 
(roughly proportionally) less CPU time on a serial computer, even when restarted at intervals com- 
mensurate with Chebyshev adaptive cycles. However, the decision on which one to prefer in parallel 
depends ultimately on the communication- to-computation rates of the hardware in question, since 
the Chebyshev version is always superior in inner products. 

0. Parallel Experiments for Flame Sheet Jacobians 

We have seen in section 4 that domain decomposition of the Newton solver is trivial except 
for the solution of the linear systems, and in section 5 that GMR/BILU is the method of choice 
for our flame sheet Jacobians in the serial context. We also have from the complexity analysis 
that individual matrix-vector multiplies are no worse from a communication point of view than 
the residual vector evaluations required by other stages of Newton’s method. Therefore, provided 
that iteration counts can be kept low, say O(10r/A), it is only the preconditioner phase of the 
GMR algorithm that poses a parallel challenge. A global ILU can be a bottleneck in parallel 
implementations because of the sequential nature of triangular factorizations and solves. Though 
wavefront-based or red-black reorderings of the standard sequential operations can alleviate this 
problem [26] in the context of sparse banded matrices, domain-decomposition approaches side-step 
it altogether by applying ILU within subdomains only. We now describe two domain- decomposed 
implementations of GMR/BILU. Both consist of creating separate BILU preconditioners for each 
subdomain and building them into a global preconditioner which is basically block-diagonal at the 
subdomain level. The two methods differ in that one of the preconditioners is allowed nonzero 
border blocks to provide for the flow of information between subdomains in the preconditioner 
phase. 

The GMR iterations are already “inner iterations” from the point of view of the overall non- 
linear process. To avoid any deeper nesting of iterations in applications, we choose a priori to 
iterate simultaneously on all of the unknowns in the linear system, in the sense that the subdomain 
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Figure 5: Sparsity patterns for two different decompositions 
of a rectangular region into two strips using a 9-point finite- 
difference template: (a) without edge; (b) sparsity pattern 
of A for (a); (c) with edge; (d) sparsity pattern of A for 
(c). The B operators have the asterisked blocks removed 
for parallelism. Within each subdomain the gridpoints are 
ordered lexicographically. 


problems are not individually iterated to convergence before their values are used to update the 
right-hand sides of the equations for the interfacial unknowns. This form of full matrix domain 
decomposition was advocated in [5] for problems in which no fast solver is known for the subdomain 
interior problems, and has been demonstrated therein to lead to an optimally convergent scheme 
for a class of self-adjoint strongly elliptic operators, provided (among other things) that spectrally 
equivalent subdomain preconditioners are employed. 

The best paradigm for domain decomposition is the decomposition into two strips of a rectan- 
gular region overlaid by a tensor-product grid. In one case the cut passes “between” grid points; in 
the other, it follows a line of gridpoints, which are ordered separately. For a 9-point operator on a 
grid with 16 interior subintervals in each direction (with Dirichlet boundary conditions eliminated) 
the resulting sparsity pattern for the operators A and B indicated graphically in Fig. 5. 
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The matrix of Fig. 5(b) can be written 


A = ( 1“ a\1 ) ’ with a P reconditioner5 = ( A o i 22 ) ( 6 - x ) 

where the A u are ILU approximations. In this method, the creation of independent threads of 
computation in the preconditioner comes at the expense of severing the interdomain coupling 
altogether. It therefore parallelizes with virtually no overhead, but potentially suffers in iteration 
count as the granularity of the decomposition is refined and more and more of the coupling is 
discarded. Its limit is block-point diagonal preconditioning. Some experiments with this technique 
have been reported in [2] and [3], and we have also tested it on all of the Jacobians of the previous 
section. It is referred to as the “uncoupled” preconditioning in the tables to follow. 

The matrix of Fig. 5(d) is 

( All 0 ^13 \ 

A = 1 0 A 22 A 23 J • (6*2) 

\ A 31 A 32 A 33 / 

Here, A 33 renders the coupling between the points on the interface itself. The conformally parti- 
tioned preconditioning matrix we propose for A is 

( Ml 0 4 13 \ 

B = ( 0 A 22 A 23 I ? (6-3) 

V 0 0 C ) 

where C approximates the Schur complement of An and A 22 in A. The exact Schur complement, 
C, may be obtained from block-Gaussian elimination on A as: 


C = A 33 - A 3 iA 11 1 Ai 3 — A32A 22 1 A23* (6*4) 

The tilde- notation in the definition of B accommodates the replacement, if convenient, of the ex- 
act Aij with approximations thereto. We assume throughout that the A„ are invertible. (This 
is certainly a reasonable requirement for a discrete convective- diffusive operator.) Under this as- 
sumption, C is also invertible [ 8 ]. This technique is referred to as the “Modified Schur Complement 
(MSC)-coupled” preconditioning. For further details on the motivation and construction of MSC- 
interface preconditioning, see [7, 15, 18]. For present purposes it is important only to note that the 
MSC-coupled version is hierarchical; that is, it is built on top of the subdomain preconditioning 
and requires a small number of ILU solves on each subdomain to construct. If the subdomain 
preconditioning is poor, there is no reason to expect that the MSC preconditioning of the interfaces 
is good. Note that the inverse of either B can be applied with one solve in each subdomain. 

Tables 2 through 5 in this section report parallel results on the four Jacobian systems considered 
above. 

The first case, the initial pseudo-transient step on the 31 X 31 grid, is solved with each pos- 
sible number of processors between 1 and 16. One notes that the iteration count grows mono- 
tonically with the granularity of the decomposition for the uncoupled version, a degradation of 
convergence penalty for throwing away information. Somewhat surprisingly, the iteration count 
has non-monotonic behavior in the coupled case. The fact that the 8 -subdomain case requires 
fewer iterations than the global algorithm causes super-linear speed-up; that is the 8 -processor 
code executes more than 8 times as fast as the 1 -processor code. This effect is due to an interaction 
between the MSC and ILU preconditioners, and eventually is swamped by the loss of the “chunky” 
subdomain condition at high processor numbers. The actual parallel efficiencies e are plotted in 
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Fig. 6. In order to separate iteration count effects from synchronization and communication effects, 
a third .curve appears in this figure. For the third curve, which is for the MSC-coupled case, the 
iteration count has been fixed at the p = 1 total of 6, independent of residual convergence. The 
peaks of this curve occur at p = 2 d , at which all of the subdomain interiors have the same discrete 
thickness. (For p = 2 d , there are d + 1 subdomains of exactly 2 5 ~ d - 1 rows each.) At all other p, 
there are load imbalances between the subdomains which allow idling. 
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Table 2: Iteration count /, CPU time T , and efficiency e 
for Jacobians from the initial pseudo-transient Newton step 
of the flame sheet problem with horizontal strips on a 31 x 31 
grid as a function of number of processors p. 

For the next case, the final pseudo-transient step on the 31 x 31 grid, and subsequently, 
attention is focused on iteration behavior, by considering p = 2 d only. One notes that the iteration 
count grows monotonically and rather badly for both preconditioners; however, wall-clock speed- 
ups are obtained all the way to the limit of 16 processors. Neither method is preferable at all 
values of p; however, the uncoupled one is surprisingly good. This competitive behavior of the 
uncoupled algorithm is not generally found with scalar convective-diffusive systems (see [18]), since 
spatial coupling is dominant in such problems. In reacting flows, the grid-point blocking is enough 
for a decent preconditioning. In fact, the attempt of the MSC preconditioning to account for 
spatially-connected neighbors interacts negatively with the BILU in several cases. 

For the third case, the full Newton step on the 31 x 31 grid, all of the conclusions of the previous 
case are repeated. The efficiencies are the lowest of any of the four cases due to degradation of 
convergence, but speed-ups do occur all the way up to p = 16. 

The final case is a reprise of the initial one. The MSC-coupled version displays a retrograde 
iteration count, and very high terminal efficiencies are obtained, in spite of the loss of “chunkiness”. 

All of the decompositions presented in this section were for strips normal to the axial direc- 
tion. Cases for strips normal to the radial direction were also run, but were always inferior for a 
corresponding number of subdomains. This is a natural reflection of the tight radial coupling in 
the physics of the problem, which is preserved in strips which span from the axis to the confining 
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Figure 6: Graphs of efficiency versus number of processors 
for the solving the first linear system at the first pseudo- 
transient Newton step on the 31 X 31 flame sheet problem. 
The uniform dashed curve is for the uncoupled algorithm 
and the patterned dashed curve for the MSC-coupled version. 
The solid curve is for a reference case in which the number 
of iterations has been fixed at a constant value of 6. 
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Table 3: Iteration count /, CPU time T, and efficiency e for 
Jacobians from the final pseudo- transient Newton step of the 
flame sheet problem with horizontal strips on a 31 x 31 grid 
as a function of number of processors p. 
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Table 4: Iteration count /, CPU time T, and efficiency e for 
Jacobians from a full Newton step of the flame sheet problem 
with horizontal strips on a 31 X 31 grid as a function of number 
of processors p. 
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Table 5: Iteration count I, CPU time T, and efficiency e 
for Jacobians from the initial pseudo- transient Newton step 
of the flame sheet problem with horizontal strips on a 63 X 63 
grid as a function of number of processors p. 

wall. It is also an argument against box-wise decompositions, which are desirable in more isotropic 
problems from a surface- to- volume ratio point of view. 

Summarizing the results of this section, GMR/BILU has acceptable behavior in all representa- 
tive stages of a flame sheet calculation, but guidelines on when the interface coupling has sufficient 
incremental value to warrant its use are still lacking. 

7. Conclusions and Further Directions 

We conclude that domain decomposition is a practical means of getting appreciable speed- 
ups (~ 10) on a 16-processor array for representative two-dimensional finite-differenced systems 
of PDEs. The generalized minimum residual method with point-blocked ILU preconditioning is 
powerful in this context, but places more demands on the global communication properties of the 
interprocessor network than a Chebyshev method with the same preconditioning. 

Further research is suggested in areas of general algorithmic interest and those specific to 
reacting flow systems. 

A hybrid GMR-Chebyshev algorithm has been suggested in [10], which would seem very well 
suited to parallel implementation, since it combines low inner product count of Chebyshev iteration 
with an eigenvalue estimation scheme using a GMR-Arnoldi method, in which even the inner 
products themselves are by-products of an iteration-advancing step. 

It would be of interest to compare our subdomain-based parallel ILU preconditioners to those 
in which the factorization (block-point or otherwise) is carried out over an entire domain. Since 
no interfacial blocks requiring approximation are introduced, it is expected that this technique, 
appropriately blocked, will have a better iteration count than subdomain-based techniques on most 
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problems; however, its parallel efficiency can be so low in some problems (due to synchronization 
delays) that it is inferior even to unpreconditioned GMR [4] as a parallel algorithm. A wide range 
of comparisons are needed here. 

There is no consideration of complex domain geometry in the present work although ease of 
generalization to this case is an equally relevant motivation. The case of problems requiring too 
much memory to be managed by just one processor is another one in which the proposed technique is 
attractive, despite possible iteration count degradation relative to a global technique. Comparisons 
to non-domain decomposition-based techniques for non-tensor product grids and out-of-core solvers 
would be relevant on large scientific and engineering problems. 

Finally, from the applications point of view, Jacobians from detailed kinetics reacting flow mod- 
els should be studied. The regime of many components per gridpoint (r « 20 rather than r = 3) is 
expected to further bring out of the advantages of block-point preconditioning, reduce the impor- 
tance of accurate interfacial coupling in the overall iteration count, and reduce the communication- 
to-computation ratio by adding zero-space-dimensional work. Verification of these statements will 
mean that future turn-around times for multidimensional reacting flow calculations on currently 
available supercomputers will be measured in hours rather than days. 
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